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On Improved Loss Estimation 
for Shrinkage Estimators 

Dominique Fourdrinier and Martin T. Wells 



Abstract. Let X be a random vector with distribution Pq where 9 is 
an unknown parameter. When estimating 9 by some estimator <p(X) 
under a loss function L(9,(p), classical decision theory advocates that 
such a decision rule should be used if it has suitable properties with 
respect to the frequentist risk R(9,ip). However, after having observed 
X = x, instances arise in practice in which (p is to be accompanied by 
an assessment of its loss, L(9,tp(x)), which is unobservable since 9 is 
unknown. A common approach to this assessment is to consider es- 
timation of L(9,(p(x)) by an estimator <5, called a loss estimator. We 
present an expository development of loss estimation with substantial 
emphasis on the setting where the distributional context is normal and 
its extension to the case where the underlying distribution is spheri- 
cally symmetric. Our overview covers improved loss estimators for least 
squares but primarily focuses on shrinkage estimators. Bayes estimation 
is also considered and comparisons are made with unbiased estimation. 

Key words and phrases: Conditional inference, linear model, loss esti- 
mation, quadratic loss, risk function, robustness, shrinkage estimation, 
spherical symmetry, SURE, unbiased estimator of loss, uniform distri- 
bution on a sphere. 



1. INTRODUCTION 

Suppose X is an observable from a distribution Pq 
parameterized by an unknown parameter 9. In clas- 
sical decision theory, it is usual, after selecting an es- 
timation procedure (p(X) of 9, to evaluate it through 
a loss criterion, L(9, <p(X), which represents the cost 
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incurred by the estimate p>{X) when the unknown 
parameter equals 9. In the long run, as it depends 
on the particular value of X , this loss cannot be ap- 
propriate to assess the performance of the estima- 
tor ip. Indeed, to be valid (in the frequentist sense), 
a global evaluation of such a statistical procedure 
should be based on averages over all the possible 
observations. Consequently, it is common to report 
the risk R(9, ip) = Eq[L{9, y>(X)] as a measure of the 
efficiency of ip (Eq denotes expectation with respect 
to Pq). Thus we have at our disposal a long-run per- 
formance of <p{X) for each value of 9. However, al- 
though this notion of risk can effectively be used in 
comparing <p(X) with other estimators, it is inacces- 
sible since 9 is unknown. The usual frequentist risk 
assessment is the maximum risk R^ = sup e R(9, ip). 
By construction, this least favorable report of the 
estimation procedure is non-data-dependent [as we 
were guided by a global notion of accuracy of <p(X)]. 
However, there exist situations where the fact that 
the observation X has such or such value may influ- 
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ence the judgment on a statistical procedure. A par- 
ticularly edifying example is given by the following 
simple confidence interval estimation (which can be 
viewed as a loss estimation problem). Assume that 
the observable is a couple (Xi,X 2 ) of independent 
copies of a random variable X satisfying, for 9 € R, 

p[x = e - 1] = p[x = e + 1] = \. 

Then it is clear that the confidence interval for 9 
defined by 



I(x 1 ,x 2 ) = i9e 



x ± + x 2 



9 



1 
<2 



satisfies 



9e/(Xi,x 2 )] 



ax 1 =x 2 , 



so that it suffices to observe (Xi,X 2 ) in order to 
know exactly whether I(Xi,X 2 ) contains 9 or not. 

The previous (ad hoc) example indicates that da- 
ta-dependent reports are relevant. When X = x the 
loss, L(9,cp(x)), itself could serve as a perfect mea- 
sure of the accuracy of ip if it were available (which 
it is not since 9 is unknown) . It is natural to estima- 
te L(9,ip(x)) by a data-dependent estimator 5(X), 
a new estimator called a loss estimator. Such an esti- 
mator can serve as a data-dependent assessment (in- 
stead of Rp). This is a conditional approach in the 
sense that the accuracy assessment is made on a da- 
ta-dependent quantity, the loss, instead of the risk. 

To evaluate the extent to which 5(X) successfully 
estimates L(9,ip(X)), another loss is required and 
it has become standard, for simplicity, to use the 
squared error 

(1.1) L*(9,<p(X),5(X)) = (5(X)-L(9,cp(X))) 2 . 

Insofar as we are thinking in terms of long-run fre- 
quencies, we adopt a frequentist approach to evalu- 
ating the performance of L* by averaging over the 
sampling distribution of X given 9, that is, by using 
a new notion of risk 



(1.2) 



K(e,<p,6) = E e [L*(eMX)AX))} 

= E e \{5{X)-L{9MX))Y 



As Rip reports on the worst possible situation (the 
maximum risk), we may expect that a competitive 
data-dependent report S(X) should improve on R v 
under the risk (1.2), that is, for all 9, 5(X) satisfies 



More generally, a reference loss estimator So will be 
dominated by a competitive estimator S if, for all 9, 



;i-4) 



TZ(9,(p,S)<TZ{9,(f,S ), 



(1.3) 



11(9, tp, 5) <H(e,<p,Ru 



with strict inequality for some 9. 

Unlike the usual estimation setting where the quan- 
tity of interest is a function of the parameter 9, 
loss estimation involves a function of both 9 and X 
(the data). This feature may make the statistical 
analysis more difficult but it is clear that the usual 
notions of minimaxity, admissibility, etc., and their 
methods of proof can be directly adapted to that 
situation. Also, although frequentist interpretabil- 
ity was evoked above, in case we would be inter- 
ested in a Bayesian approach, it is easily seen that 
this approach would consist of the usual Bayes es- 
timator (fB of 9 and the posterior loss 5b(X) = 
E[L(9,<p B )\X\. 

The problem of estimating a loss function has 
been considered by Sandved [43] who developed a no- 
tion of unbiased estimator of L(0,<p(X)) in various 
settings. However, the underlying conditional ap- 
proach traces back to Lehmann and Sheffe [37] who 
estimated the power of a statistical test. Kiefer, in 
a series of papers [33-35], developed conditional and 
estimated confidence theories. A subjective Bayesian 
approach was compared by Berger [4-6] with the fre- 
quentist paradigm. Jonhstone [32] considered (inad- 
missibility of unbiased estimators of loss for the ma- 
ximum likelihood estimator ifo(X) = X and for 
the James-Stein estimator <£> JS (AT) = (1 — (p — 2)/ 
||X|| 2 )X of a p-variate normal mean 9. For ipo(X) = 
X, the unbiased estimator of the quadratic loss 
L(9,(po(X)) = \\ipo(X) — 9\\ 2 , that is, the loss esti- 
mator 5q which satisfies, for all 9, 

(1.5) E e [6 o ]=Ee[L(0,<p o (X))]=R(9,<po), 

is So = Rip = p. Johnstone proved that (1.3) is satis- 
fied with the competitive estimator S(X) = p — 2(p — 
4)/ 1| X || 2 when p > 5, the risk difference between So 
and S being expressed as — 4(p — 4) 2 £^[1/||X|| 4 ]. For 
the James-Stein estimator tp , the unbiased esti- 
mator of loss is itself data-dependent and equal to 
£JS(X) =p-(p-2) 2 /\\X\\ 2 . Jonhstone showed that 
improvement on Sq can be obtained with 5 (X) = 
p-(p-2) 2 /\\X\\ 2 + 2p/\\X\\ 2 when p> 5, with strict 
inequality in (1.4) for all 9 since the difference in risk 
between <5 JS and Sq S equals — 4p 2 £ , e [l/||X|| 2 ]. 

In Section 2, we develop the quadratic loss esti- 
mation problem for a p-normal mean. After a review 
of the basic ideas, a new class of loss estimators is 
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constructed in Section 2.1. In Section 2.2, we turn 
our focus on some interesting and surprising behav- 
ior of Bayesian assessments; this paradoxical result 
is illustrated in a general inadmissibility theorem. 
Section 3 is devoted to the case where the variance 
is unknown. Extensions to the spherical case are 
given in Section 4. In Section 4.1, we consider the 
general case of a spherically symmetric distribution 
around a fixed vector 8 £ MP and in Section 4.2 these 
ideas are then generalized to the case where a resid- 
ual vector is available. We conclude by mentioning 
a number of applied and theoretical developments 
of loss estimation not covered in this overview. The 
Appendix gives some necessary background material 
and technical results. 

2. ESTIMATING THE QUADRATIC LOSS OF 

A P-NORMAL MEAN WITH KNOWN 

VARIANCE 

2.1 Dominating Unbiased Estimators of Loss 

Let X be a p-variate normally distributed M(8, I p ) 
random vector with unknown mean 9 and identity 
covariance matrix I p . To estimate 8, the observ- 
able X is itself a reference estimator (it is the maxi- 
mum likelihood estimator (m.l.e.) and it is an unbi- 
ased estimator of 9) so that it is convenient to write 
any estimator of 8 through X as <p(X) = X + g(X), 
for a certain function g from MP into MP. Under 
squared error loss ||y?(A) — 8\\ 2 , the (quadratic) risk 
of <p is defined by 

(2.1) R(8,p)=E e [y(X)-8\\ 2 }, 

where Eg denotes the expectation with respect to 
N{9,I P ). 

Clearly, the risk of the m.l.e. X equals p and in 
general ip(X) will be a reasonable estimator only 
if its risk is finite. It is easy to see (Lemma A.l 
in Appendix A.l) through Schwarz's inequality that 
this is the case as soon as 



(2.2) 



E e [\\g(X)\\ 2 ]<oo, 



which we will assume in the following (it can be 
also seen that this condition is in fact necessary to 
guarantee the risk finiteness). 

To improve on the m.l.e. X when p > 3 [i.e., 
to have R(9, ip) < p], Stein [48] exhibited (under cer- 
tain differentiability conditions that we recall be- 
low) an unbiased estimator of the risk of ip(X), that 
is, a function 5q(X) (depending only on X and not 
on 9) for which 



(2.3) 



R{0,<p)=Eg[S o (X)]. 



This suggests a natural estimator of the loss ||</?(A) — 
9\\ 2 since (2.3) implies that 

(2.4) Eg[y(X)-8\\ 2 ] = E e [8 (X)] 

and hence is an unbiased estimator of the loss. 
Stein [48] proved more precisely that 5q(X) =p + 
2-d\vg(X) + \\g(X)\\ 2 [where &i\ g(X) stands for the 
divergence of g{X), i.e., divg(X) = Y%=x d i9i( x )]- 
One can see that 5q may change sign so that, as an 
estimator of loss (which is nonnegative) , it cannot 
be completely satisfactory, and hence, is likely to be 
improved upon. 

Any competitive loss estimator S(X) can be writ- 
ten as 6(X) = 5q(X) — j(X) for a certain func- 
tion 7(A) which can be interpreted as a correction 
to 6 (X). Note that, for the m.l.e. [i.e., if g(X) =0], 
we may expect that an improvement on 6q(X) =p 
would be obtained with a nonnegative function 7(A) 
satisfying the requirement expressed by condi- 
tion (1.3). Note also that, similarly to the finiteness 
risk condition (2.2), we will require that 

(2.5) £ e [ 7 2 (A)]<oo 

to assure that the risk of 8(X) is finite (see Ap- 
pendix A.l). 

Using straightforward algebra, the risk difference 
T>(8, tp, 5) = 1Z(9, ip, 5) — 71(8, ip, 80) simplifies to 

V(9,ip,8) = E e h 2 (X) -2 7 (X)5o(X)} 
(2.6) 

+ 2Eeh(X)\\p(X)-9\\ 2 }. 

Conditions for which T>(9,ip,5) < will be formu- 
lated after finding an unbiased estimate of the term 
j(X)\\p(X) — 9\\ 2 in the last expectation. We briefly 
review the flow of ideas of those techniques. 

For a function g from M p into M p , the Stein's iden- 
tity (see Stein [48]) states that 

(2.7) E e [(X - 9) t g(X)] = E e [divg(X)] 

provided that these expectations exist. Here Stein 
specified that g was almost differentiable. Weak dif- 
ferentiability is needed to integrate shrinkage func- 
tions g(X), intervening in the James-Stein estima- 
tors, of the form g(X) = —aX/\\X\\ 2 which are not 
differentiable in the usual sense [such a g(X) ex- 
plodes at zero]. This notion is equivalent (and it 
is of more common use in analysis) to the state- 
ment that g belongs to the Sobolev space Wj' (M p ) 
of weakly differentiable functions. That equivalence 
was noticed by Johnstone [32]. 

Recall that a locally integrable function 7 from MP 
into M is said to be weakly differentiable if there 
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exist p functions hi,...,h p locally integrable on 
such that, for any i = 1, . . . ,p, 



^ x )^r( x ) dx 



hi(x)ip(x) dx 



for any infinitely differentiable function <p on MP 
with compact support. The functions hi are the ith 
partial weak derivatives of 7. Their common nota- 
tion is dj/dxi and the vector V7 = (d^/dxi, . . . , dj/ 
dxpY is referred to as the weak gradient of 7. 

Note that (2.8) usually holds when 7 is continu- 
ously differentiable, that is, when hi = dj/dxi, the 
standard partial derivative, is continuous. Thus, 
via (2.8), the extension to weak differentiability con- 
sists in a propriety of integration by parts with van- 
ishing bracketed term. Naturally a function g = (gi, 
. . . , g p ) from M p into M p is said to be weakly differen- 
tiable if each of its components gj is weakly differen- 
tiable. In that case, the function divg = X/f=i^#*/ 
dx i is referred to as the weak divergence of g; this is 
the operator intervening in the Stein's identity (2.7). 

When dealing with an unbiased estimator of 
a quantity of the form \\X — 9\\ 2 ^/(X), where 7 is 
a function from M p into M, writing 

(2.9) \\X - Ofj{X) = {X- 9) 1 {X - 9)j(X) 

naturally leads to an iteration of Stein's identi- 
ty (2.7) and involves twice weak differentiability of 7. 
This is of course defined through the weak differen- 
tiability of all the weak partial derivatives d'y/dxi] 
these second weak partial derivatives are denoted 
by d 2r y/dxj dxi. Thus 7 belongs to the Sobolev spa- 
ce W lo ' c (M p ) and A7 = Y^=i & 2 l/dxf is referred to 
as the weak Laplacian of 7. 
By (2.9) and (2.7), we have 

EMX 



(2.10) 



| 2 7P0] 
E e [div{{X-9) t 1 {X))\ 



= E e [p 1 {x) + (x-e) t v 1 {x)} 

by the product rule for the divergence operator. Then, 
applying again (2.7) to the last term in (2.10) gives 

E e [(X - efV-f(X)} = £ e [div(V 7 (X)] 
(2.11) 

= E e [A 1 (X)] 

by definition of the Laplacian operator. Finally, gath- 
ering (2.10) and (2.11), we have that 



(2.12) 



Eg[\\X 



2 l(X)} 



= E e [p 7 (X) + A 7 (X)]. 

We are now in a position to provide an unbiased 
estimator of the difference in risk T>(9, ip, 5) in (2.6). 



Its nonpositivity will be a sufficient condition for 
T>{9, (p, 5) < and hence for 5 to improve on Sq. In- 
deed we have 

Mx)-ef 

= \\x + g (x)-e\\ 2 

= \\g(x)\\ 2 + 2(x-ey g (x) + \\x-0\\ 2 

so that, according to (2.7) and (2.12), 

E e [Mx)-e\\ 2 1 (x)] 

= E e [ 7 (X) \\g(X)\\ 2 + 2 diY( 1 (X)g(X)) 
+P7(X) + A 7 (X)]. 

Therefore, as dw{-/(X)g(X)) = ^(X)divg(X) + 
V 1 {X) t g(X) and as 5 (X) = p + 2divg{X) + 
\\g(X)\\ 2 , the risk difference V(6,<p,8) in (2.6) re- 
duces to 

V{9, if, 5) = EetfiX) + 4V 7 (X) t < ? (X) + 2A 7 (X)], 

so that a sufficient condition for T>(9, ip, 5) to be non- 
positive is 

(2.13) 7 2 (x) + 4V7(z)*s(x) + 2A 7 (x) < 

for any x € M p . 

The question now arises of determining a "best" 
correction 7 satisfying (2.13). The following theorem 
provides a way to associate to the function g a suit- 
able correction 7 which satisfies (2.13) in the case 
where g(x) is of the form g(x) = Vm(i)/m(i) for 
a certain nonnegative function m. This is the case 
when (p is a Bayes estimator of 9 related to a prior it, 
the function m being the corresponding marginal 
(see Brown [10]). Bock [8] showed that, through the 
choice of m, such estimators constitute a wide class 
of estimators of 9 (which are called pseudo-Bayes es- 
timators when the function m does not correspond 
to a true prior n). 

Theorem 2.1. Let m be a nonnegative function 
which is also superharmonic (respectively subhar- 
monic) on MP such that Vm/m E Wj ' (MP) . Let £ be 
a real-valued function, strictly positive and strictly 
subharmonic (respectively superharmonic) on MP 
such that 

a ew ) 

Assume also that there exists a constant K > such 
that, for any x £MP, 

e(x) 



(2.14) 



E$ 



< 00. 



(2.15) 



and let K\ 



m{x) > K 



|A£(x) 







■ f ( \|A£(a:)| 

mf^gRp m(x) ^Yx) 
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Then the unbiased loss estimator 5$ of the esti- 
mator if of 9 defined by tp(X) = X + Vm(X)/m(X) 
is dominated by the estimator 8 = Sq — 7, where the 
correction term 7 is given, for any x €W such that 
m(x) 7^ 0, by 



(2.16) 



7(x) = — asgn(A£(x)) 



m[x) 



as soon as < a < 2Kq. 



Proof. The domination condition will be shown 
by proving that the risk difference is less than zero. 
We only consider the case where m is superharmonic 
and £ is strictly subharmonic, the case where m is 
subharmonic and £ is strictly superharmonic being 
similar. 

First note that the finiteness risk condition (2.5) 
is guaranteed by the condition in (2.14) and the fact 
that (2.15) implies that, for any x G MP, 



7 2 (x) 



a 



e(x) 



< 



or 

'K 2 
-"■0 



A£(x) 



m 2 (x) Kq \ £(x) 

Further note that, for a shrinkage function g of 
the form g{x) = Vm(x)/m(x), the left-hand side 
of (2.13) can be expressed as 

TZj(x) = 7 (x) 
(2.17) 

A(m(x)j(x)) Am(i) 



+2^2 



m(x) 
and hence, for 7 in (2.16), as 



7(x) 



m(x) 



K>y(x) 



a 



(2.18) 



e(x) 

m 2 (x) 



A£(x) | £(x)Am(x) 



m(x) m 2 {x) 

Now, since m is superharmonic and £ is positive, it 
follows from (2.18) that 

a£ 2 (x) 



Krf(x) < 



n 



2A£(x) 



m(x) [ m(x) 

and hence, by subharmonicity of £, the inequality 
in (2.15) and the definition of Kq, that 

£ 2 (x) 



(2.19) Tlj(x) < 



a 



■{a-2K }- 



m(x) L " m{x) 

Finally, since < a < 2/<o, the inequality in (2.19) 
gives 1Z*y(x) < 0, which is the desired result. □ 

As an example, consider m(x) = l/||x|| p_2 , that is, 
the fundamental harmonic function which is super- 
harmonic on the entire space M. p (see Du Plessis [17]). 



Then we have Vm(i)/m(i) = — (p — 2)/||x|| 2 
and <p(X) is the James-Stein estimator whose unbi- 
ased estimator of loss is Sq(X) = p — (p— 2) 2 /||X|| 2 . 
First note that Vm/m 6 W x ' (W) for p > 3. Now 
choosing, for any x 7^ 0, the function £(x) = l/||x|| p 
gives rise to A£(x) = 2p/||x|| p+2 > and hence to 

e(x) i_ 1 

|A£(x)| ~ 2jj||x||p- 2 ' 

which means that condition (2.15) is satisfied with 
K < 2p. Also we have 



/A£(xj 

W(*) 



4p 2 



which implies that the condition in (2.14) is satis- 
fied for p > 5. Now it is clear that the constant Kq is 
equal to 2p and that the correction term 7 in (2.16) 



equals, for any z^O, j(x) = —a/ 



x 



'. Finally, The- 



orem 2.1 guarantees that an improved loss estimator 
over the unbiased estimator of loss 5q(X) is 5(X) = 
8o(X) + a/||x|| 2 for < q < Ap, which is Johnstone's 
result [32] for the James-Stein estimator. 

Similarly Johnstone's result for <p(X) = X can be 
constructed with m(x) = 1 (which is both subhar- 
monic and superharmonic) and with the choice of 
the superharmonic function £(x) = l/||x|| 2 , for which 
Kq = 2(p — 4), so that S(x) =p — a/\\x\\ 2 dominates p 
for 0<a<4(p-4). 

We have shown that the unbiased estimator of 
loss can be dominated. Often one may wish to add 
a frequentist-validity constraint to a loss estimation 
problem. Specifically in our problem, the frequentist- 
validity constraint for some estimator 5 would be 
E e [5(X)} > E 6 [S (X)} for all 9. Kiefer [35] suggested 
that conditional and estimated confidence assess- 
ments should be conservatively biased, that is, the 
average reported loss should be greater than or equal 
to the average actual loss. Under such a frequentist- 
validity condition Lu and Berger [40] gave improved 
loss estimators for several of the most important 
Stein-type estimators. One of their estimators is a ge- 
neralized Bayes estimator, suggesting that Bayesians 
and frequentists can potentially agree on a condi- 
tional assessment of loss. 

A possible problem with the improved estimator 
defined in (2.16) is that it may be negative, which 
is undesirable since we are estimating a nonnega- 
tive quantity. A simple remedy to this problem is 
to use a positive-part estimator. If we define the 
positive-part as 5 + = max{<5, 0}, the loss difference 
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between 5+ and 5 is {5-L(9, <p)) 2 - (5+ -L(9, <p)) 2 = 
(5 2 — 28L{9, ip))±s<o, hence it is always nonnegative. 
Therefore the risk difference is positive, which im- 
plies that 5 + domimates 5. It would be of interest 
to find an estimator that dominates S + . 

In the context of variance estimation, despite warn- 
ings on its inappropriate behavior (Stein [46], 
Brown [9]) the decision-theoretic approach to the 
normal variance estimation is typically based on the 
standardized quadratic loss function, where overes- 
timation of the variance is much more severely pe- 
nalized than underestimation, thus leading to pre- 
sumably too small estimates. Similarly in loss esti- 
mation under quadratic loss, the overestimation of 
the loss is also much more severely penalized than 
underestimation. A possible alternative to quadratic 
loss would be a Stein-type loss. Suppose ip(X) is an 
estimator of 9 under ||0 — (/j(A~)|| 2 and let 8{X) be 
an estimator of ||0 — (^(A~)|| 2 for 5(X) > 0. Then we 
can define the Stein-type loss for evaluating S(X) as 



L(9,ip(X\S(X)) 

(2.20) 



<P(X)\f 



6(X) 



log- 



<p(x)\\< 



5(X) 



1. 



The analysis of the loss estimates under the Stein- 
type loss is more challenging but can be carried 
out using the integration-by-parts tools developed 
in this section. 

2.2 Dominating the Posterior Risk 

In the previous sections, we have seen that the 
unbiased estimator of loss should be often dismissed 
since it can be dominated. When a (generalized) 
Bayes estimator of 9 is available, incorporating the 
same prior information for estimating the loss of this 
Bayesian estimator is coherent, and we may expect 
that the corresponding Bayes estimator is a good 
candidate to improve on the unbiased estimator of 
loss. However, somewhat surprisingly, Fourdrinier 
and Strawderman [22] found that, in the normal 
setting considered in Section 2.1, the unbiased es- 
timator often dominates the corresponding general- 
ized Bayes estimator of loss for priors which give 
minimax estimators in the original point estimation 
problem. They also gave a general inadmissibility re- 
sult for a generalized Bayes estimator of loss. While 
much of their focus is on pseudo-Bayes estimators, 
in this section, we essentially present their results 
on generalized Bayes estimators. 



For a given generalized prior tt, we denote the gen- 
eralized marginal by m and the generalized Bayes 
estimator of 9 by 



(2.21) 



p m (X) = X + 



Vm(X) 
m(X) ' 



Then (see Stein [48] ) the unbiased estimator of risk 
of (p m (X) is 

s , ,,^ Am(x) ||Vm(X)|| 2 

2.22 8o(X)=p + 2—±+- n 2 ) " 

777(A) 777/ (A) 

while the posterior risk of ip m (X) is 

Am(I) ||Vm(X)|| 2 



(2.23) 6 m (X)=p + 



m(X) 



m 2 (X) 



Domination of 5q(X) over 5 m (X) is obtained 
thanks to the fact that their risk admits (Am(X)/ 
m(X)) 2 — 2A^ 2 'm{X)/m{X) as an unbiased estima- 
tor of their risk difference, that is, 



(2.24) 



lZ(9,ip m ,8 ) -U(9,ip m ,5 m ) 



Eft 



Am(X)\ 2 A^m(X) 



m{X) J m(X) 

where A( 2 )m = A(Am) is the bi-Laplacian of m 
(see [22]). Thus the above domination will occur as 
soon as 



(2.25) 



Am{X) 

m{X) 



A^m(X) 

777(A) 



Applicability of that last condition is underlined by 
the remarkable fact that if the prior tt satisfies (2.25), 
that is, if 



(2.26) 



Att(0) 



A( 2 K(9) 

7T(0) 



then (2.25) is satisfied for the marginal 777. 

As an example, Fourdrinier and Strawderman [22] 
considered ir(9) = (||0|| 2 /2 + a)~ b (where a > and 
b > 0) and showed that, if p > 2(6 + 3) then (2.26) 
holds and hence 5q dominates S m . Since tt is inte- 
grable if and only if b > | (for a > 0), the prior -it 
is improper whenever this condition for domination 
of <5o over S m holds. Of course, whenever tt is proper, 
the Bayes estimator 5 m is admissible provided its 
Bayes risk is finite. 

Inadmissibility of the generalized Bayes loss esti- 
mator is not exceptional. Thus, in [22], the follow- 
ing general inadmissibility result is given; its proof 
is parallel to the proof of Theorem 2.1. 
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Theorem 2.2. Let m be a nonnegative function 
such that Vm/m € Wj ' (M p ) . Let £ be a real-valued 
function satisfying the conditions of Theorem 2.1. 
Then 5 m is inadmissible and a class of dominating 
estimators is given by 



5 m (X) + asgn(AZ(X)) 



m{X) 



forO<a< 2K . 



Note that, unlike Theorem 2.1, neither the super- 
harmonicity condition nor the subharmonicity con- 
dition on m is needed. Note also that Theorem 2.2 
gives conditions of improvement on 5 m while The- 
orem 2.1 looks for improvements on 5q- As we saw, 
often <5o dominates 5 m . So it is not surprising that 
the proofs of the two theorems are parallel; more 
precisely, it suffices to suppress, in the proof of The- 
orem 2.1, the superharmonicity (or subharmonic- 
ity) condition on m to obtain the proof of Theo- 
rem 2.2. 

In [22], it is suggested that the inadmissibility of 
the generalized Bayes (or pseudo-Bayes) estimator is 
due to the fact that the loss function (S(x) — ||y?(x) — 
6\\ 2 ) 2 may be inappropriate. The possible deficiency 
of this loss is illustrated by the following simple re- 
sult concerning estimation of the square of a location 
parameter in R . 

Suppose X € R ~ f((X - 6) 2 ) such that E e [X 4 ] < 
oo. Consider estimation of 9 2 under loss (5 — 2 ) 2 . 
The generalized Bayes estimator 5^ of 6 2 with re- 
spect to the uniform prior ir(0) = 1 is given by 



S*(X) 



f9 2 f((X-9) 2 )d6 

jf((x-ey)d9 



X 2 + E [X 2 }. 



Since this estimator has constant bias 2Eq[X 2 ], it is 
dominated by the unbiased estimator X 2 — Eq[X 2 ] 
(the risk difference is 4(.Eo[A 2 ]) 2 ). Hence 5^ is inad- 
missible for any /(•) such that E@[X 4 ] < oo. 

2.3 Examples of Improved Estimators 

In this subsection, we give some examples of The- 
orems 2.1 and 2.2. The only example up to this 
point of an improved estimator over the unbiased 
estimator of loss 8q(X) is 6(X) = 5q(X) + a/||x|| 2 
for < a < 4p, which is Johnstone's result [32]. Al- 
though the shrinkage factor in Theorems 2.1 and 2.2 
is the same, in the examples below we will only focus 
on improvements of posterior risk. 

As an application of Theorem 2.2, let £&(x) = 
(||x|| 2 + a)~ b (with a > and b > 0). It can be shown 
that we have A£ fe (x) < for a > and < 2(6 + 
1) < p. Also A£ 6 (x) > if a = and 2(6 + 1) > p. 



Furthermore 

e b {x) 

|A6(*)| 



1 



26|p-2(6 + l)||x|| 2 /(||x|| 2 + a)| 
1 



|x|| 2 +a) 6 - 1 ' 



(a) Suppose that < 2(6 + 1) < p and a > 0. Then 



e b {x) 



< 



i 



i 



IxP + af- 1 



|A&(z)| ~2b(p- 2(6 + 1)) 

and E e [(A^ b (X) / ^ b (X)) 2 ] < oo since it is bounded 
from above by a quantity proportional to £?g[(||X|| + 
a)~ 2 ], which is finite for a > or for a = and p > 4. 
Suppose that m(x) is greater than or equal to 
some multiple of (|[£|| 2 + a) 1 or equivalently 



(2.27) m(x) > 



k 



1 



IxP + af- 1 



2b{p- 2(6 + 1)) 

for some k > 0. Theorem 2.2 implies that S m (X) is 
inadmissible and is dominated by 

ct 

Om\X) / v \n\ v\\2 i \b 

m(X)(\\X || z + a) 

for < a < 4b(p - 2(6 + l))mf xmP (m(x)(\\x\\ 2 + 
a) ). Note that the improved estimators shrink 
toward 0. 

Suppose, for example, that m(x) = 1. Then (2.27) 
is satisfied for 6 > 1. Here (p m (X) = X and S m (X) = p. 
Choosing 6 = 1, an improved class of estimators is 
given by p — irypx^ for < a < 4(p — 4). The case 
a = is equivalent to Johnstone's result for this 
marginal. 

(b) Suppose that 2(6+ 1) > p > 4 and a = 0. Then 



e b (x) 



i 



i 



|A&(s)| 26(2(6 + l)-p)||x|| 2 ( b - 1 )' 

A development similar to the above implies that, 
when m{x) is greater than or equal to some multiple 
of ||x|p 1-b ), an improved estimator is 

OL 

5m{X) + m(X)||X|p 

for < a < 46(2(6 + 1) - p)m( xmP (m(x)\\x\\ 2( - b -^). 
Note that, in this case, the correction term is pos- 
itive and hence the estimators expand away from 0. 
Note also that this result only works for a = and 
hence applies to pseudo-marginals which are un- 
bounded in a neighborhood of 0. Since all marginals 
corresponding to a generalized prior tt are bounded, 
this result can never apply to generalized Bayes pro- 
cedures but only to pseudo-Bayes procedures. 
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Suppose, for example, that m(x) 



\ 2 ~p. Here 



ip m (X) = (1 - ■ 
and 5 m (X) = p 
plies for 6—1 



P-2 



)X is the James-Stein estimator 



(P-2) 



II X\ 



3-. In particular, the above ap- 



^2~, that is, for 6 



>2^. An 
for 



improved estimator is given by 8 m {X) + jxW 
< 7 < Ap. This again agrees with Johnstone's re- 
sult for James-Stein estimators. 

3. ESTIMATING THE QUADRATIC LOSS OF 

A P-NORMAL MEAN WITH UNKNOWN 

VARIANCE 

In Section 2 it was assumed that the covariance 
matrix was known and equal to the identity ma- 
trix I p . Typically, the covariance is unknown and 
should be estimated. In the case where it is of the 
form a 2 I p with a 2 unknown, Wan and Zou [51] sho- 
wed that, for the invariant loss ||<£>(-X") — 8\\ 2 /a 2 , 
Johnstone's result [32] can be extended when es- 
timating the loss of the James-Stein estimator. In 
fact, the general framework considered in Section 2 
can be extended to the case where a 2 is unknown, 
and we show that a condition parallel to Condi- 
tion (2.13) can be found. 

Before stating the main result for the unknown 
variance case, we need an extension of Stein's iden- 
tity involving the sample variance. 

Lemma 3.1. Let X ~ M(9,a 2 I p ) and let S be 
a nonnegative random variable independent of X 
such that 5~(7 2 xl- Denoting by E 8cr 2 the expecta- 
tion with respect to the joint distribution of (X,S), 
we have, provided the corresponding expectations ex- 
ist, the following two results: 

(i) if g(x,s) is a function from M p x R + into MP 
such that, for any s GR+, g{-,s) is weakly differen- 
tiable, then 



Ea 



1 



a- 



(x-ey g (x,s) 



E e 2 [div x g(X,S)}, 



where div x g(x,s) is the divergence of g(x,s) with 
respect to x; 

(ii) if h(x, s) is a function from MP x R + into M 
such that, for any s £ M + , h(-,s) is weakly differen- 
tiate, then 



Ea 



^h(X,S) 



a 



E e,a 2 



2-^/i(X, S) + (k- 2)S~ 1 h{X, S) 



Proof. Part (i) is just Stein's lemma (cf. [48]). 
Part (ii) can be seen as a particular case of Lem- 



ma 1(h) (established for elliptically symmetric dis- 
tributions) of Fourdrinier et al. [23], although we 
will present a direct proof. The joint distribution of 
(X, S) can be viewed as resulting, in the setting of 
the canonical form of the general linear model, from 
the distribution of (X,U) ~ M((6,0),a 2 I p+k ) with 



S 



\U\\ 2 . Then we can write 
1 



Ea 



a- 



Mx,s) 



Ea 



Ea 



(7 



\u^ 2 h(x,\\u\\ 2 ) 



div;y 



U 



\U\ 



■h(X, \\U\\ 2 ) 



according to part (i). Hence, expanding the diver- 
gence term, we have 

1 



Ea 



Mx,s) 



Ea 



k-2 

W¥ 
+ 



Ea 



k-2 



h{X,\\U\\ 2 ) 
Vuh(X,\\U\\ 2 ) 







h(X,S) + 2—h(X,S) 



since 



V v h(X, \\U\\ 2 ) 



2-^h(X,S) 



U. 



s=\\u\\ 



a 



The following theorem provides an extension of re- 
sults in Section 2 to the setting of an unknown vari- 
ance. The necessary conditions to insure the finite- 
ness of the risks are given in Appendix A.l. 

Theorem 3.1. Let X ~ M{6,o 2 L p ) where 6 
and a 2 are unknown and p>5 and let S be a non- 
negative random variable independent of X and such 
that S ~ °~ 2 Xk- Consider an estimator of 9 of the 
form tp(X,S) = X + Sg(X,S) with Ea^ 2 [S 2 \\g(X, 
S)\\ 2 ] < 00, where Ea a i denotes the expectation with 
respect to the joint distribution of (X, S) . 

Then an unbiased estimator of the invariant loss 
\\<p(X,S)-e\\ 2 /a 2 is 



So(X,S) 



(3.1) 



p + S\(k + 2)\\g(X, S) || 2 + 2 div x g(X, S) 







.2S-^\\g(X,S)\\ 2 
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d(9,a 2 ,X,S) 



1 



Its risk 1Z(9,a 2 ,ip,5o) = Eg (T 2[(5o(X,S) — \\y>(X, with So(X, S) can be written as 
S) - 0\\ 2 /a 2 )) 2 } is finite as soon as E e ^ 2 [S 2 \\g(X, At a Ji 

S) || 4 ] < oo, E ea2 [(Sdivxg(X,S)) 2 ] < oo and 
E e , ff2 [(S 2 -k\\9(X^S)\\) 2 ]<oo. 

Furthermore, for any function j(X) such that 
E 9a 2 [y 2 (X)] < oo, the risk difference T>(9, a 2 , if, 5) = 
TZ(6,a 2 ,ip,5) — TZ(6,a 2 ,ip,5(j) between the estimators (3-5) 
S(X, S) = 5q(X, S) — S'y(X) and 5q(X, S) is given by 



5 (x,s)-^\\<p(x)-e 



a' 



Sl(X) 

2 



5 (x,s)--y(x) 



E, 



e,cr 2 



(3.2) 



s 2 li 2 (x) + 



k + 2 



A 7 (X) 



S 2 1 2 {X) 
-2S 7 (X)(s Q (X,S) 



+ V(X, S) V 7 (A) + 4rf(X) \\g(X, S)\f 



o' 



■MX) 



Hence, since E e>(7 2[\\(p(X,S) - 9\\ 2 /a 2 ] < oo as the 
risk of the estimator ip(X,S), the condition 
Therefore a sufficient condition for V (9, a 2 , p, 5) to Eg tU 2[y 2 (X)] < oo ensures that the expectation of 
be nonpositive, and hence for S(X, S) to improve on t h e loss in (3.5), that is, the risk difference V(9, a 2 , 

ip, 5) is finite. Then 1Z(9, a 2 , tp, 5) < oo since 1Z(9, a 2 , 



6 (X,S), is 

7 2 (x) + 



(3.3) 



Aj(x) + 4g* (x, s)Vj(x) 
<0 



k + 2 

+ 4j(x)\\g(x,s)\\ 2 

for any x E MP and any s G M+ . 

Proof. According to the expression of ip(X, S), 
its risk R(9,(p) is the expectation of 



(3.4) 



J_\\X-9\\ 2 + 2-^(X 

+ ^\\g(X,S)\\ 2 . 



l g(x,s) 



ip,8 Q ) <oo. 

We now express the risk difference T>(9, a 2 , p, 5) = 
E e ^ a 2[d(9,a 2 ,X,S)]. Using (3.1) and expanding 
\\<p(X,S) -9\\ 2 /a 2 give that d(9,a 2 ,X,S) in (3.5) 
can be written as d(9, a 2 ,X, S) = A(X, S) + B(9, a 2 , 
X, S) where 

A(X,S) = S 2 1 2 (X)-2pS 1 (X) 

-2(k + 2)S 2 1 (X)\\g(X,S)\\ 2 
(3.6) 

-4S 2 j(X)dw x g(X,S) 



Clearly Eg a 2 [a 2 \\X — 9\\ 2 ] =p and Lemma 3.1 im- 
plies that 



and 



En 



\{X-dfg{X,S) 



4S 3 j(X)-^\\g(X,S)\\ 2 



s 3 



E 0ta *idivxg(x,s)] 



B(9,a z ,X,S) = 2— 1 (X)\\g(X,S)\\< 
a z 



and, with h(x,s) = s 2 \\g(x,s)\\ 2 , that 



(3.7) 



aJrrWII* 



a* 



Ea 



K\\g(x,s)\f 



a' 



+ 4- rf (X)(X-d) t g(X,S). 



a* 



Ea 



S Uk + 2)\\g(X,S)\\' 







^2S-^\\g{X,S)\\ 2 

Therefore R{9, p) = E e ^ [5 {X, S)] with 5 (X, S) gi- 
ven in (3.1), which means that 5o(X,S) is an un- 
biased estimator of the invariant loss \\(p(X,S) — 
#|| 2 /ct 2 . The fact that the risk ft(0,a 2 ,y?,<5 o ) of <5 (A) 
is finite is shown in Lemma A.l. 

Now consider the finiteness of the risk of the alter- 
native loss estimator 5(X, S) = 5q(X, S) — 57(A). It 



Through Lemma 3.1(h) with h(x,s) = 2^3-7(2;) ■ 
||g(x,s)|| 2 , the expectation of the first term in the 
right-hand side of (3.7) equals 

S 3 



En 



o"- 



MX)\\g(X,S)\\< 



(3.8) 



Ea 



2{k + 4)S 2 1 (X)\\g(X,S)f 
+ 45 3 7 (A)^|| 5 (A,S)|| 2 



An iterated application of Lemma 3.1 (i) to the 
is easily seen that its difference in loss d(9,o~ 2 ,X,S) expectation of the second term in the right-hand 
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side of (3.7) allows to write 
S 



Ea 



a*- 



n(x)\\x 



En 



1 



a 



2 (X 



f S 7 P0(* 



= E 6 ^[2div x {s^(x)(x-e)}} 

= E e>(T 2[2 P S 7 (X) + 2S(X - 9) t X7 1 (X)] 

= E dia2 [2pS 1 (X) + 2a 2 SA 7 (X)} 

which, as 5~ a 2 x\ entails that E[S 2 /(k + 2)] 
E[a 2 S] and as S is independent of X, gives 



Ea 



(3.9) 



S 



a*- 



l{X)\\X 



Ea 



S 2 



2pS 1 (X) + 2—A 1 (X) 



As for the third term in the right-hand side of (3.7), 
its expectation can also be expressed using Lem- 
ma 3.1 (i) as 

.S 2 



E a 



n (X)(X-9) t g(X,S) 



(3.10) 



= E e ^S 2 div x { 7 (X)g(X,S)}] 

= E eia 2[AS 2 1 (X)dW x {g(X,S)} 

+ iS 2 g(X,S) t V 7 (X)] 

by the product rule for the divergence. Finally, gath- 
ering (3.8), (3.9) and (3.10) yields an expression 
of (3.7) which, with (3.6), gives the integrand term 
of (3.2), which is the desired result. □ 

As an example, consider the James-Stein estima- 
tor with unknown variance 

^ S (X,S) - P ~ 



X 



:X. 



k + 2\\X\\ 2 ' 

Here the shrinkage factor is the product of a function 
of S with a function of X so that, through routine 
calculation, the unbiased estimator of loss is 

S (X,S)=p 



(P-2) 



s 



k + 2 \\X\\ 2 ' 

For a correction of the form "f{x) = — d/\\x\\ 2 with 
d > 0, it is easy to check that the expression in (3.3) 
equals 

2 



d 2 + ^:d-S P ^id-A( P T -^\ d 



k + 2 



k + 2 



k + 2 



did 



k + 2 



p + 



(P ~ 2) : 

k + 2 



and 



which is negative for < d < jrr^ip + fc+2 ] 

gives domination of p — fc+2 ttxp + irAi over p — 

,. 1 I, JU . This condition recovers the result of Wan 

k+2 \\X\\ Z 

and Zou [51] who considered the case d = -j^lp + 

(P-2) 2 1 

k+2 r 

4. EXTENSIONS TO THE SPHERICAL CASE 

4.1 Estimating the Quadratic Loss of the Mean 
of a Spherical Distribution 

In the previous sections the loss estimation prob- 
lem was considered for the normal distribution set- 
ting. The normal distribution has been generalized 
in two important directions, first as a special case of 
the exponential family and second as a spherically 
symmetric distribution. In this section we will con- 
sider the latter. There are a variety of equivalent def- 
initions and characterizations of the class of spher- 
ically symmetric distributions; a comprehensive re- 
view is given in [20]. We will use the representation 
of a random variable from a spherically symmetric 

distribution, X = (X 1 ,..., X p f, asl = RU^ + 9, 
where R = \\ X — 9\\ is a random radius, U^ p ' is a uni- 
form random variable on the p-dimensional unit sphe- 
re, where R and U^ p ' are independent. In such a sit- 
uation, the distribution of X is said to be spherically 
symmetric around 6 and we write X ~ SS P (6). We 
also extend, in Section 4.2, these results to the case 
where the distribution of X is spherically symmet- 
ric and when a residual vector U is available (which 
allows an estimation of the variance factor a 2 ) . 

Assume X ~ SS P (9) and suppose we wish to esti- 
mate 9 € MP by a decision rule 5(X) using quadratic 
loss. Suppose that we also use quadratic loss to as- 
sess the accuracy of loss estimate S(X); then the 
risk of this loss estimate is given by (1.2). In [26], 
the problem of estimating the loss when <p(X) = X 
is the estimate of the location parameter 9 is consid- 
ered. The estimate ip is the least squares estimator 
and is minimax among the class of spherically sym- 
metric distributions with bounded second moment. 
Furthermore, if one assumes the density of X ex- 
ists and is unimodal, then ip is also the maximum 
likelihood estimator. 

The unbiased constant estimate of the loss \\X — 
9\\ 2 is <5o = E e [R 2 }. Note that 5q is independent of 9, 
since E e [\\X - 9\\ 2 } = £ [||^|| 2 ]- Fourdrinier and 
Wells [26] showed that the unbiased estimator <5o 
can be dominated by 5q — 7, where 7 is a particular 
superharmonic function for the case where the sam- 
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pling distribution is a scale mixture of normals and 
in more general spherical cases. 

The development of the results depends on some 
interesting extensions of the classical Stein identities 
in (2.7) and (2.12) to the general spherical setting. 
Since the distribution of X, say Pg, is spherically 
symmetric around 9, for every bounded function /, 
we have Eg[f] = EE Rfi [f] = J R+ E Rfi [f]p(dR), whe- 
re p is the distribution of the radius, namely the dis- 
tribution of the norm ||X — 9\\ under Pg and where E 
and Erq denote respectively the expectation with 
respect to the radial distribution and uniform distri- 
bution U Rj g on the sphere Sr,b = {x £W \ \\x — 0|| = 
R} of radius R and center 9. To deduce the various 
risk domination results it suffices to work condition- 
ally on the radius, that is to say to replace Pg by U Rt g 
in the risk expressions. Let a Ri g denote the area 
measure on S R $. Therefore, for every Borel measur- 
able set A, 'u R>e (A) = o R fi{A)/a(S Rfi ) = T(p/ 
2)a Rt g(A)/2TT p / 2 R p - 1 . Define the volume measu- 
re t R)8 on the ball B R>8 = {x € W | \\x - 9\\ < R} of 
radius R and center 9 and denote the uniform distri- 
bution on B Rj g as V R) g. Hence, for every Borel mea- 
surable set A, V R fi{A) = T Rt e(A)/TRfi{BR,o) =pF(p/ 
2)r R} g(A) /2tt p ' 2 RP . Suppose 7 is a weakly differen- 
tiable vector-valued function; then by applying the 
Divergence Theorem for weakly differentiable func- 
tions to the definition of the expectation we have 



E e [(X-9) t ^(X) J \\X-9\\ = R] 



(4.1) 



{x - 9) t 1 {x)U Rfi (dx) 



Sr.8 



R 



VR,e{S_ 



Rfi) JB 



div7(x) dx. 



If 7 is a real- valued function, then it follows 
from (4.1) and the product rule applied to the vector- 
valued function (x — 9)-f(x) that 

E e [\\x - e\\ 2 -f(x) \\\x-t 



(4.2) 



R] 



(x-9)\x-6) 1 {x)U Rfi {dx) 



R 



&R,e(S 



Rfi) 



Br.i 



\pj(x) + (x- 9) t S/'y(x)} dx. 



Our first extension of Theorem 2.1 is to the class 
of spherically symmetric distributions that are scale 



mixtures of normal distributions. Well-known ex- 
amples in the class of densities include the double 
exponential, multivariate t-distribution (hence, the 
multivariate Cauchy distribution). Let (p(x;9,I) be 
the probability density function of a random vec- 
tor X with a normal distribution with mean vec- 
tor 9 and identity covariance matrix. Suppose that 
there is a probability measure on M + such that the 
probability density function pg may be expressed as 



(4.3) pg(x\9) 



<j>(x;9,I/s)G(d<;) 



One can think of T being a random variable with 
distribution G; the conditional distribution of X 
given T = q,X\T = q, is N P {9,I /q). This class con- 
tains some heavy-tailed distributions, possibly with 
no moments. It is well known (see [20]) that, if a sphe- 
rical distribution has a density pg, it is of the form 
pg{x) = g{\\x — 9\\ 2 ) for a measurable positive func- 
tion g (called the generating function). 

In the scale mixture of normals setting the unbi- 
ased estimate, 5q, of risk equals 



E[R 2 



Eg\\\X 



P 



-^(ck). 



It is easy to see that the risk of the unbiased es- 
timator 5q is finite if and only if ^[||X — 9\\ 4 ] < 00, 
which holds if 



(4.4) 



q~ 2 G(dq) < 00. 



The main theorem in [26] is the following domi- 
nation result of an improved estimator of loss over 
the unbiased loss estimator. 

Theorem 4.1. Assume the distribution of X is 
a scale mixture of normal random variables as 
in (4-3) such that (4-4) * s satisfied and such that 



(4.5) 



v 



p / 2 G{d 



< 00. 



Also, assume that the shrinkage function 7 is twice 
weakly differentiable on W and satisfies Eg[y 2 ] < 00, 
for every 9 £ W. Then a sufficient condition for 
5q — 7 to dominate 8q is that 7 satisfies the differ- 
ential inequality 



(4.6) /cA 7 + 7 2 < with k = 2 



k, < p/2 G(d,) 



JR + 



: p/2-2 



G((k) 
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As an example let j(x) = c/||a;|| 2 where c is a pos- 
itive constant. Note that 7 is twice weakly differen- 
tiable only when p > 4 (thus its Laplacian exists as 
a locally integrable function). Then it may be shown 
that Aj(x) = -2c(p - 4)/||:e|| 4 . Hence kA(x) 



7 2 (x) 



-2kc(p-A)/\\x\\ 4 + c 2 / 



\ 4 < if -2kc(p ■ 



4) + c 2 < 0, that is, < c < 2k(p - 4). It is easy to 
see that the optimal value of c for which this in- 
equality is the most negative equals k(p — 4), so 
an interesting estimate in this class of 7's is 5 = 
5q — k(p — A) /\\x\\ 2 (p > 4). This is precisely the es- 
timate proposed by [32] in the normal distribution 
case N p (9, 1) where k = 2; recall, in that case 5q =p. 

In this example, we have assumed that the dimen- 
sion p is greater than 4. In general we can have dom- 
ination as long as the assumptions of the theorem 
are valid. Actually, Blan chard and Fourdrinier [7] 
showed explicitly that, when p < 4, the only solu- 
tion 7 in L 2 (MP) of the inequality kAj + 7 2 < is 
7 = 0, almost everywhere with respect to the Lebes- 
gue measure A. Now, in the normal setting N p (9, I/s), 
an unbiased estimator of the risk difference between 
an estimator 5 = 5q — 7 and do is 2<j _2 A7 + 7 2 . Hence, 
for dimensions 4 or less, it is impossible to find an 
estimator 5 = 5o — 7 whose unbiased estimate of risk 
is always less than that of So- Indeed we cannot 
have Eq[2^~ 2 A^/ + j 2 ] < 0, for some 6, without hav- 
ing A[?~ 2 A7(x) + 7 2 (x) < 0] > 0, which entails that 
A[ 7 (x)/0]>0. 

In the case of scale mixture of normal distribu- 
tions, the conjecture of admissibility of Sq for lower 
dimensions, although it is probably true, remains 
open. Indeed, under conditions of Theorem 4.1, 
&A7 + 7 2 is no longer an unbiased estimator of the 
risk difference and Eg [k A7 + 7 2 ] is only its upper 
bound. The use of Blyth's method would need to 
specify the distribution of X (i.e., the mixture distri- 
bution G). It is worth noting that dimension-cutoff 
also arises through the finiteness of Eg[y 2 ] when us- 
ing the classical shrinkage function c/||:r|| 2 . 

In order to prove Theorem 4.1 we need some addi- 
tional technical results. The first lemma gives some 
important properties of superharmonic functions and 
is found in Du Plessis [17] and the second lemma 
links the integral of the gradient on a ball with the 
integral of the Laplacian. 

Lemma 4.1. If 7 is a real-valued superharmonic 
function, then: 

(i) Is R ,o l( x )UR,e(dx) < J Brs i{x)V Ri e{dx), 
(ii) both of the integrals in (i) are decreasing in R. 



PROOF. See Sections 1.3 and 2.5 in [17]. □ 

Lemma 4.2. Suppose 7 is a twice weakly differ- 
entiate function. Then 



R.H 



(x - OfV^x^R^dx) 

= PEm±[ R r [ A ,(x)dxdr 
2TTP/ 2 Wo J Br ,e ^ ' 

Proof. Since the density of the distribution of 
the radius under Vr,b is (p/R p )r p ~ 1 ti ^](r), we have 



B 



(x-efv^VR^dx) 



R.H 



R 



Pp-K 



(x - OfVjixpr^dx)^- 1 dr. 
JS r - KP 



r,B 



The result follows from applying (4.1) to the inner- 
most integral of the right-hand side of this equality 
and by recalling the fact that o~ r% o(S r% o) = (2ir p i 2 / 

T{p/2))-rP~ 1 . □ 

Proof of Theorem 4.1. Denoting by p the 
distribution of the radius \\X — 6\\ , the risk difference 
between do and do — 7 equals a(9) + j3{6) where 

a(6) = I a R (6)p(dR) and 

(4.7) 



with 



(4i 



and 



(4.9) 



m = / p R {0)p(dR) 



a R (9) = 2R 2 l(x)V Rfi (dx) 

J Br,6 

-2\o I ~f(x)U Rt0 (dx) 

J Sr,8 



p R {6) = 2— [ (x- efv-yWVwidx) 

+ f 7 2 (x)[/ /? , e ( f ix). 



Indeed, the risk difference conditional on the ra- 
dius R equals 

[2\\x - e\\ 2 7 (x) - 2A o7 (x) + 1 2 (x)]U Rt6 (dx) 

e 

and the result follows from (4.2) applied to the first 
term between brackets. 
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Let us first deal with a(9) considering the first (x — 9) t 'V^(x) and the function x — > V7(x) taking 

term in (4.8). We have from the definition of V Rj $ successively the role of the function 7, we obtain 
and an application of Fubini's theorem f . ^2 

72 



' I 7(x)V Rt9 (dx)p(dR) 



(4.10) 



P 



P- 



, r(p/2) 
t(p/2) 

2ttp/ 2 



R 2 ~ p I 7 (x) dxp(dR) 
Br,6 



j(x) 



R 2 ~ p p(dR) dx. 



I —I {x-e) t V 1 {x)V R ,e{dx)p,(dR) 

JR + P JB Rt e 

(x-9) t V 1 (x)U R , e (dx)p,(dR) 

~ I — I Vl{x)dxp,(dR) 
wr+ P JB Rfi 



Sr,i 

2 



Now, for fixed ? > 0, in the normal case N p (0,I/q) 
the distribution p q of the radius has the density f q 



q p/2-2 

(2vr)P/ 2 



V*y(x) exp 



9\\ 2 \dx 



of the form f^(R) 



2p/ 2 - 1 r(p/2) 



i? p 1 exp{— ^f-} and by applying (4.1) for the second equality and re- 



2. Thus the expression (4.10) becomes 



# 



membering that A7 = div(V7). Therefore by the 
Fubini Theorem (3(9) can be reexpressed as 



7 (x)V Rt g(dx)p(dR) 



[K 



B R ,e 
p/2 



13(9) 



(2tt)p/ 2 



7(x) / itexp 

llx-flll 



dRdx 



pq- 



p/2-1 



(2tt)p/ 2 
P 



7(x)exp<{ --||» 



l(x)U Rfi (dx)p^(dR), 



dx 



(4.12) 



2A 7 (x) 

/ R+ ^/ 2 - 2 exp(- ? ||x-g|| 2 /2)G(d< ? ) 
' J R+ ^eM-^-0\\ 2 mG(d,) 

+ 7 2 (x) 



the last equality holding since X = RU^ P ' . Turning 
back to (4.7) and (4.8) and using the mixture repre- 
sentation with mixing distribution G, the expression 
of a(9) is written as 

1 V 
P. 



2ttJ 

■ exp 



P/2 



G((k) dx. 



a(9) = 2p 
(4.11) 



Now, through a monotone likelihood ratio argument, 
the ratio of integrals in (4.12) can be seen to be 
bounded from below by the constant k in (4.6). 
Hence the inequality in (4.6) gives 



7(x) 



2ttJ 



m< 



•expl --||x 



9\\ 2 ) dxG(dq). 



It can be easily seen that the innermost integral 
in (4.11) is proportional to 



(A;A 7 (x)+ 7 2 (x)) 

) 

^y/2 

2ir) 



■ exp 



(u/^) 1 / 2 ^ 



j(x)dU s 1/2 u^^exp 



-— ) du 
2 



x 



9\\ 2 )G(d<,)dx 



<0. 



Finally, remembering that a(9) is nonpositive, it 

and hence is nondecreasing in q by superharmonic- t 11 t i. ^1, ■ 1 j- ff //a , o(o\ u 4. x 

& J *, follows that the risk difference a(y) + p(^j between 00 

lty of 7 induced by the inequality m 4.6) and by , r j.- i ■ 1, t i .■< n 

T J ' .... m J . 7- / r r. 1 1 and do— 7 is negative, which proves the theorem. U 
Lemma 4.1(h). Ihus, since do =p/S for hxed q, the 

expression for a (9) in (4.11) is a nonpositive covari- The improved loss estimator result in Theorem 4.1 

ance with respect to G. for scale mixture of normal distributions family was 

We can now treat the integral of the expres- extended to a more general family of spherically 

sion (3(0) in the same manner. The function x — > symmetric distributions in [26]. In this setting the 
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conditions for improvement rest on the generating 
function g of the spherical density p$. A sufficient 
condition for domination of Sq has the usual form 

fcA7 + 7 2 <0. 

Theorem 4.2. Assume the spherical distribu- 
tion of X with generating function g has finite fourth 
moment. Assume the function 7 is nonnegative and 
twice weakly differ entiable on ~WP and satisfies 
Ee[l 2 \ < 00. //, for every s > 0, 



(4.13) 



f™g(z)dz Jo 



2g(s) 



P 



and if there exists a constant k such that, for any 
s>0, 



(4.14) < k < 



1^ Z S( Z ) dz-s r s °° g( z ) dz 
Ms) 



then a sufficient condition for 5$ — 7 to dominate 80 
is that 7 satisfies the differential inequality 

kA-f + 7 2 < 0. 

We have shown that one can dominate the unbi- 
ased constant estimator of loss by a shrinkage-type 
estimator. As in the normal case one may wish to 
add a frequentist-validity constraint to the loss es- 
timation problem. It is easy to show that the only 
frequentist valid estimator of the form 5q would be 
the only frequentist valid loss estimator. The proof 
of this result follows from a randomization of the 
origin technique as in Hsieh and Hwang [30]. 

4.2 Estimating the Quadratic Loss of the Mean 
of a Spherical Distribution with a Residual 
Vector 

In this section, we extend the ideas of the previ- 
ous sections to a spherically symmetric distribution 
with a residual vector. We first develop an unbiased 
estimator of the loss and then construct a dominat- 
ing shrinkage-type estimator. An important feature 
of our results is that the proposed loss estimates 
dominate the unbiased estimates for the entire class 
of spherically symmetric distributions. That is, the 
domination results are robust with respect to spher- 
ical symmetry. 

Let (X, U) ~ SS(0, 0) where dim X = dim (9 = p and 
diniLf = dimO = k {p + k = n). For convenience of 
notation, here (X, U) and (6,0) represent nxl vec- 
tors (see Appendix A. 2 for more details on this mo- 
del). Unlike Section 4.1, the dimension of the ob- 
servable (X, U) is greater than the dimension of the 



estimand 9. This model arises as the canonical form 
of the following seemingly more general model, the 
general linear model. Let V be an n x p matrix 
(of full rank p) which is often referred to as the 
design matrix. Suppose an n x 1 vector Y is ob- 
served such that Y = V/3 + e where (3 is a p x 1 vec- 
tor of (unknown) regression coefficients and e is an 
n x 1 vector with a spherically symmetric distribu- 
tion about 0. A common alternative representation 
of this model is Y = n + e where e is as above and 77 
is in the column space of V. 

To understand this representation in terms of the 
general linear model, let G = (G\,G t 2) t be an n x n 
orthogonal matrix partitioned such that the first p 
rows of G (i.e., the rows of G\ considered as column 
vectors) span the column space of V. Now let 



Go 



Y 



G L 
Go 



Vf3 + Ge-- 



+ Ge 



with 9 = G\V(3 and G2V 'ft = since the rows of G2 
are orthogonal to the columns of V. It follows from 
the definition that (X, U) has a spherically symmet- 
ric distribution about (6,0). In this sense, the model 
given above is the canonical form of the general lin- 
ear model. 

The usual estimator of 9 is the orthogonal projec- 
tor X . A class of competing point estimators which 
are also considered is of the form 93 = X — \\U\\ 2 g(X); 
g is a measurable function from W into MP. This 
class of estimators is closely related to Stein-like es- 
timators (when estimating the mean of a normal dis- 
tribution, the square of the residual term \\u\\ is used 
as an estimate of the unknown variance) . Their dom- 
ination properties are robust with respect to spher- 
ical symmetry (cf. [11] and [12]). We will first con- 
sider estimation of the loss of the usual least squares 
estimator X, then estimation of the loss of the more 
general shrinkage estimator ip. In order to assure the 
finiteness of their risk of the usual estimator X and 
the risk of the shrinkage estimator ip, we need two 
hypotheses (HI) and (H2) given in [11]. 

In the spherical case in Section 3, the risk of X was 
constant with respect to 9. Thus this risk provides 
an unbiased estimator of the loss, that is, pE[R 2 ]/n, 
which is subject to the knowledge of E[R 2 ]. Its prop- 
erties, as the properties of any improved estimator, 
may depend on the specific underlying distribution. 
An important feature of the results in this subsec- 
tion is that we propose an unbiased estimator 5q of 
the loss of X which is available for every spherically 
symmetric distribution (with finite fourth moment), 
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that is, 5q(X,U) = p\\U\\ 2 /k. Thus we do not need 
to know the specific distribution, and we get robust- 
ness with an estimator which is no longer constant. 
Notice 5q makes sense because p <n (i.e., k > 1). 

In this subsection, we consider estimation of 6 
by X so that, as in the work of Fourdrinier and 
Wells [25], we deal with estimating the loss ||X — 
#|| 2 . An unbiased estimator of that loss is given by 
Sq(X,U) = p\\U\\ 2 /k, that we write 5q(U) since it 
depends only on U. The unbiasedness of 5q follows 
from Corollary A.l by taking q = and 7 = 1. The 
goal of this subsection is to prove the domination of 
the unbiased estimator So by a competing estima- 
tor 5 of the form 

(4.15) 5(X,U) = S (U)-\\Uf 1 (X), 

where 7 is a nonnegative function. It is important 
to notice that the "residual term" \\U\\ appears ex- 
plicitly in the shrinkage function. It has been noted 
in [11] that the use of this term allows fewer assump- 
tions about the distributions than when it does not 
appear. Specifically, this including gives a robust- 
ness property to the results, since they are valid for 
the entire class of spherically symmetric distribu- 
tions. 

We require the real-valued function 7 to be twice 
weakly differentiable, in order to include basic exam- 
ples, which are not twice differentiable. The follow- 
ing domination result is given in [25]. We will see 
below that it appears as a consequence of a more 
general result when shrinkage estimators of 9 are 
involved. 

Theorem 4.3. Assume thatp>5, the distribu- 
tion of (X, U) has a finite fourth moment and the 
function 7 is twice weakly differentiable on MP and 
there exists a constant (3 such that j(t) < /3/||i|| 2 . 
A sufficient condition under which the estimator 5 
in (4-15) dominates the unbiased estimator 6q is 
that 7 satisfies the differential inequality 



(4.16) 



7 2 + 



A7<0. 



(k + 4){k + 6) 

The standard example where 7(t) = d/||t|[ 2 for 
all t ^ with d > satisfies the conditions of the 
theorem. More precisely, it is easy to deduce that 
A7(i) = — 2d{p — 4)/||t|| 4 and thus the sufficient con- 
dition of the theorem is written as < d < 4(p — 
4) /(k + 4)(fc + 6), which only occurs when p > 5. 
Straightforward calculus shows that the optimal va- 
lue of d is given by 2(p - 4)/(fe + 4) (A; + 6). The op- 
timal constant in [11] is equal to 2{p — 4). The extra 



terms in the denominator compensate for the ||£/|| 4 
term in our estimator. 

We now consider the estimation of the loss of 
a class of shrinkage estimators considered in [11] 
(with a slight modification of their form in order to 
have notations coherent with those of the previous 
sections), that is, location estimators of the form 

(4.17) <^ = X + ||[/|| 2 <?(X), 

where g is a weakly differentiable function from MP 
into MP. In [11] it is shown that, if ||g|| 2 < —2divg/ 
(k + 2), then ip g dominates X, under quadratic loss 
for all spherically symmetric distributions with a fi- 
nite second moment. A general example of a member 
of this class of estimators is with g(X) = 

— r(\\X\\ 2 )-jjxY, where r is a positive differentiable 
and nondecreasing function, A is a positive defi- 
nite symmetric matrix and b is a positive definite 
quadratic form of MP. When r is equal to some con- 
stant a, A is the identity on MP and the quadratic 
form b is the usual norm, g reduces to a/||AT|| 2 . It 
can be shown that the optimal choice of a equals 
(p — 2)/{k + 2). A member of the class is ip r = X — 



WW 2 x 



the James-Stein estimator used 



when the variance is unknown as in Section 3. 

In Proposition 2.3.1 of Section 2.3 of [11], it is 
shown that an unbiased estimator of the loss of the 
shrinkage estimator ip g is given by 



^(X,C/) = |||C/|| 2 +(|| 5 (X)|| 5 



(4.18) 



+ 



divg(X) )\\U 



k + 2 

As in Theorem 4.3 above, the unbiased estimator 
of the loss can be improved by a shrinkage estima- 
tor of the loss. Thus the competing estimator we 
consider is 

(4.19) 59(X, U) = 5 9 (X, U) - \\U\\S(X), 

where 7 is a nonnegative function. Note that (4.19) 
is a true shrinkage estimator, while Johnstone's [32] 
optimal loss estimate for the normal case is an ex- 
panding estimator. This is not contradictory since 
we are using a different estimator than Johnstone 
and he was only dealing with the normal case. If 
g = 0, the following result reduces to Theorem 4.3. 

Theorem 4.4. Assume that p>5, the distribu- 
tion of (X, U) has a finite fourth moment and the 
function 7 is twice weakly differentiable on M p and 
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there exists a constant (3 such that 7(4) < /3/||t|| 2 . 
A sufficient condition under which the estimator S 9 
given in (4-19) dominates the unbiased estimator 8q 
is that 7 satisfies the differential inequality 



4 j- 4 

7 div g + 



(4.20) 



k + 2 



+ 



(k + A)(k + 6) 



k + 6 

A7<0. 



div (75) 



Proof. Since the distribution of (X, U) is spher- 
ically symmetric around 6, it suffices to obtain the 
result working conditionally on the radius. For R > 
fixed, we can compute using the uniform distri- 
bution Ur$ on the sphere S Rt e. Thus the condi- 
tional risk difference between S 9 and 5q, according 
to (4.19), equals 



E R , 9 [(5°jx,u)-y(x,u)-e 



|2\2i 



2\2n 



-E R , e [(6 9 (x,u)-y(x,u)-er) 

E Rt e[\\Uf^(X)] 
- E R 42\\U\\S(X) 

■(8 9 (X,U)-\\<p(X,U) 



% 



that is, expanding and separating the integrand 
terms depending on 6, 



Er,6 



u\\*r(x) 



2|||C/|| 6 7(X) 



\U\\ 8 dlvg(X) 



k + 2 
+ E R>e [qU\\ 6 (X-9) t 1 (X)g(X)} 

+ E Rt e[2\\U\\ 4 \\X - ef^X)}, 

according to (4.18) (note that the two terms involv- 
ing ||g(X)|| 2 cancel). Now we have 

E Rfi {A\\U\\\X-e) t 1 {X)g{X)\ 
4 



-E 



R,e[ 



k + 6 
according to Lemma A. 2 and 



U\\»div( 7 (X)g(X))} 



E Rtd [2\\U\\ 4 \\X - 9\\ 2 7 (X)] 
= E R,e\i^i\\U\\ 6 7(X) 



k + 4< 



+ 



(k + 4)(k + 6) 



\U\\ s A-f(X) 



according to Corollary A.l. Therefore the above con- 
ditional risk difference is equal to 



Era 



u\\ 8 h\x) 



+ 



■divg(X) 



k + 6 



+ 



k + 2 
div( 7 (X)g(X)) 

Aj(X) 



(k + 4)(k + 6) 



E 



Rfi 



2p 



fe-4 k 



\Ufl(X) 



which is bounded above by the first expectation 
since the function 7 is nonnegative. Hence, the suffi- 
cient condition for domination is (4.20) in order that 
the inequality R(59, 6, 99) < R{8 9 G , 9, ip) holds. □ 

5. DISCUSSION 

There are several areas of the theory of loss es- 
timation that we have not discussed. Our primary 
focus has been on location parameters for the multi- 
variate normal and spherical distributions. Loss es- 
timation for exponential families is addressed in Lele 
[38, 39] and Rukhin [42]. In [38] and [39] Lele devel- 
oped improved loss estimators for point estimators 
in the general setup of Hudson's [31] subclass of con- 
tinuous exponential family. Hudson's family essen- 
tially includes distributions for which the Stein-like 
identities hold; explicit calculations and loss estima- 
tors are given for the gamma distribution, as well as 
for improved scaled quadratic loss estimators in the 
Poisson setting for the Clevenson-Zidek [13] estima- 
tor. Rukhin [42] studied the posterior loss estimator 
for a Bayes estimate (under quadratic loss) for the 
canonical parameter of a linear exponential family. 

As pointed out in the Introduction, in the known 
variance normal setting, Johnstone [32] used a ver- 
sion of Blyth's lemma to show that the constant loss 
estimate p is admissible if p < 4. Lele [39] gave some 
additional sufficient conditions for admissibility in 
the general exponential family and worked out the 
precise details for the Poisson model. Rukhin [42] 
considered loss functions for the simultaneous esti- 
mate of 6 and L(8,ip(X)) and deduced some inter- 
esting admissibility results. 

A number of researchers have investigated impro- 
ved estimators of a covariance matrix, E , under the 
Stein loss, L S (E,E) = t^EE" 1 ) - loglEE^ 1 ] - p, 
using an unbiased estimation of risk technique. In 
the normal case, [15, 27, 45, 47], and [49] proposed 
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improved estimators that dominate the sample co- 
variance under Ls(X,£). In [36], it was shown that 
the domination of these improved estimators over 
the sample covariance estimator is robust with re- 
spect to the family of elliptical distributions. To 
date, there has not been any work on improving the 
unbiased estimate of LsCE,T,). 

In addition to the theoretical ideas discussed in 
the previous sections there are very practical ap- 
plications of loss estimation. The primary applica- 
tion of loss estimation ideas is to model selection. 
It was shown by Fourdrinier and Wells [24] that 
improved loss estimators give more accurate model 
selection procedures. Bartlett, Boucheron and Lu- 
gosi [3] studied model selection strategies based on 
penalized empirical loss minimization and pointed 
out the equivalence between loss estimation and da- 
ta-based complexity penalization. It was shown that 
any good loss estimate may be converted into a data- 
based penalty function and the performance of the 
estimate is governed by the quality of the loss esti- 
mate. Furthermore, a selected model that minimizes 
the penalized empirical loss achieves an almost opti- 
mal trade-off between the approximation error and 
the expected complexity, provided that the loss esti- 
mate on which the complexity is based is an approx- 
imate upper bound on the true loss. The key point 
to stress is that there is a fundamental dependence 
on the notions of good complexity regularization and 
good loss estimation. The ideas in this review lay the 
theoretical foundation for the construction of such 
loss estimators and model selection rules as well as 
give a decision-theoretic analysis of their statistical 
properties. 

In linear models the notion of degrees of freedom 
plays the important role as a model complexity mea- 
sure in various model selection criteria, such as Akai- 
ke information criterion (AIC) [1] , Mallow's C p [41], 
and Bayesian information criterion (BIC) [44], and 
generalized cross-validation (GCV) [14]. In regres- 
sion the degrees of freedom are the trace of the so- 
called "hat" matrix. Efron [18] pointed out that the 
theory of Stein's unbiased risk estimation is central 
to the ideas underlying the calculation of the degrees 
of freedom of certain regression estimators. 

Specifically, let Ybea random vector having an n- 
variate normal distribution M(9, a 2 I n ) with unknown 
p-dimensional mean 9 and identity covariance ma- 
trix a 2 I n . Let 6 = p(Y) be an estimate of 9. In re- 
gression one focuses on how accurate ip can be in 
predicting using a new response vector y new . Under 



the quadratic loss, the prediction risk is £'{||y ncw — 
9\\ 2 }/n. Efron [18] noted that 

E{\\p-9\\ 2 } = E{\\Y-<p(Y)\\ 2 -na 2 } 
(5.1) 



+ 2 



■n 

£ 

8=1 



Cov (ipi,Yi). 



This expression suggests a natural definition of the 
degrees of freedom for an estimator tp as df (</?) = 
EILiCov^Y^At 2 = E 9 [(Y - 9) t p(Y)]/a 2 . Thus 
one can define a C p -type quantity 

f\\ 2 2d%) 



(5.2) C p {ip) 



\Y 



+ 



-a 



n 



) r 



which has the same expectations as the true pre- 
diction error but may not be an estimate if df(</?) 
and a 2 are unknown. However, if <p is weakly differ- 
entiable and a 2 is an unbiased estimate of a 2 , the 
integration by parts formula in Lemma 3.1 implies 
that di(ip)a 2 = Eg[div tp(Y)a 2 ], hence div ipa 2 is un- 
biased estimate for the complexity parameter term, 
df((/?)<7 2 , in (5.2). Therefore an unbiased estimate for 
the prediction error is 

(5.3) C»=" 



<P\ 



2 div ip A 9 
+ -a 2 . 



n 



n 



Note that, if p is a linear estimator (ip = Sy for 
some matrix S independent of Y), then it is easy 
to show that this definition coincides with the def- 
inition of generalized degrees of freedom given by 
Hastie and Tibshirani [28] since dive/? = tr(S). Note 
that, if ip also depends on a 2 , then (5.1) needs to 
be augmented by additional derivative terms with 
respect to a 2 as in Theorem 3.1. 

Other approaches for estimating the complexity 
term penalty involve the use of resampling meth- 
ods [18, 52] to directly estimate the prediction er- 
ror. A K-fold cross-validation randomly divides the 
original sample into K parts, and rotates through 
each part as a test sample and uses the remainder 
as a training sample. Cross-validation provides an 
approximately unbiased estimate of the prediction 
error, although its variance can be large. Other com- 
monly used resampling techniques are the nonpara- 
metric and parametric bootstrap methods. 

A number of new regularized regression methods 
have recently been developed, starting with Ridge 
regression [29] , followed by the Lasso [50] , the Elas- 
tic Net [53], and LARS [19]. Each of these estimates 
is weakly differentiable and has the form of a gen- 
eral shrinkage estimate; thus the prediction error es- 
timate in (5.3) may be applied to construct a model 
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selection procedure. Zou, Hastie and Tibshirani [54] 
used this idea to develop a model selection method 
for the Lasso. In some situations verifying the weak 
differentiability of ip may be complicated. 

Loss estimates have been used to derive nonpara- 
metric penalized empirical loss estimates in the con- 
text of function estimation, which adapt to the un- 
known smoothness of the function of interest. See 
Barron et al. [2] and Donoho and Johnstone [16] for 
more details. 

In the previous sections, the usual quadratic loss 
L(9,ip(x)) = \\ip(x) — 9\\ 2 was considered to evalu- 
ate various estimators <p(X) of 9. The squared norm 
\\x — #|| 2 was crucial in the derivation of the proper- 
ties of the loss estimators in conjunction with its 
role in the normal density or, more generally, in 
a spherical density. Other losses are thinkable but, 
to deal with tractable calculations, it matters to 
keep the Euclidean norm as a component of the 
loss in use. Hence a natural extension is to con- 
sider losses which are a function of \\x — 9\\ 2 , that 
is, of the form c(||x — #|| 2 ) for a nonnegative func- 
tion c defined on R + . The problem of estimating 
a function c of ||x — #|| 2 was tackled by Fourdrinier 
and Lepelletier [21] to which we refer the reader 
for more details. In particular, they focused on the 
fact that estimating c(\\x — 6\\ 2 ) can be viewed as 
an evaluation of a quantity which is not necessarily 
a loss. Indeed it includes the problem of estimating 
the confidence statement of the usual confidence set 
{9 £ MP | || x — 0\\ 2 < c a } with confidence coefficient 



2. Let X ~ M(8,a 2 I p ), where 9 and <r 2 



are un- 



1 — a: c is the indicator function 1 



[0,c a 



APPENDIX 

A.l Risk Finiteness Conditions 

Lemma A.l. 1. Let X ~J\[~(9,I P ), where 9 is un- 
known, and denote by Eg the expectation with re- 
spect to the distribution of X . Consider an estima- 
tor of 9 of the form <f(X) = X + g(X) where g is 
a function from MP into W p . 

a. If g is such that Eg[\\g(X)\\ 2 ] < oo, then the 
quadratic risk of(p(X), that is, R(9,ip) = Eg[\\tp(X) — 
9\\ 2 ], is finite. 

b. If, in addition, the function g is weakly differen- 
tiable so that 5q(X) = p-\-2div g(X) + \\g(X)\\ 2 is an 
unbiased estimator of the loss \\<p(X) — 9\\ 2 , then the 
risk of Sq(X) defined by 71(9, (p, So) = Eq[(5q(X) — 
\\(p(X) - 9\\ 2 ) 2 } is finite as soon as £ e [||#(X)|| 4 ] < oo 
and Eg[(div g(X)) 2 ] < oo. 



known, let S be a nonnegative random variable in- 
dependent of X and such that S ~ c 2 Xn and denote 
by Eg a 2 the expectation with respect to the joint dis- 
tribution of(X,S). Consider an estimator of 9 of the 
form (p(X, S) = X + Sg(X, S) where g is a function 
from W x E + into MP. 

a. If g is such that Eg^ a 2[S 2 \\g(X, S)\\ 2 ] < oo, then 
the quadratic risk of <p(X), that is, R(9,o~ 2 ,(p) = 
Eg^[\\ip(X,S)-8\\ 2 /a 2 ], is finite. 

b. If, in addition, the function g is weakly differ- 
entiable so that 

So(X,S) 



p + S\(n + 2)\\g(X,S)\\ 







+ 2d\v x g{X,S) + 2S-^\\g{X,S)\\ 2 



is an unbiased estimator of the loss \\tp{X, S) — 
9\\ 2 /o~ 2 , then the risk of 5q(X,S) defined by lZ(9,a 2 , 
if, S ) = Eg^ l(S (X, S) - MX, S) - 9f/a 2 )) 2 } is ft- 
nite as soon as E dt(7 2[S 2 \\g{X, S)\\ 4 ] < oo, 
E dta 2[(Sdivg(X,S)) 2 } < oo and E dt(T 2[(S 2 ^\\g(X, 
S)\\) 2 }. 



PROOF. La. The loss of <p(X) can be expanded 



as 



(A.4) 



MX) 



\\X-9\\ 2 + \\g(X)\\ 2 
+ 2(X-9) t g(X). 



Now we have £?g[||X — 9\\ 2 ] = p < oo. Hence, by 
Schwarz's inequality, it follows from (A.4) that 
\E e [(X - 9fg(X)]\ < (E e [\\X 9\\ 2 ])^ 2 ■ 

(EglWg^W 2 }) 1 / 2 . Therefore, as soon as 
E [\\g(X)\\ 2 ] < oo, we will have \Eg[\\ip(X) - 8\\ 2 ] < 
oo. This is the desired result. 

b. Note that, under the usual domination condi- 
tion, that is, 2divg(x) + \\g(x)\\ 2 < for any x £ R p , 
of 5o(X) over X, the condition Eg[(div g(X)) 2 ] < oo 
implies that ^[||5f(A)|| 4 ] < oo. We will have 71(9, (p, 
Sq) = Eg[(S (X) - \\<p(X) - 9\\ 2 )) 2 ] < oo as soon as 
E 6 {8l(X)] < oo and E e [\\<p(X) - 9\\ 4 } < oo. Now 
E e [S 2 (X)] =Eg[(p + 2divg(X) + \\g(X)\\) 2 ] < oo sin- 
ce E e [(dw g(X)) 2 } < oo and Eg[\\g(X)\\ 4 ] < oo. Also 
according to (A.4) 

Eg[\\ip(X)-9f] 



E e [(\\X-9\\ 2 + \\g(X)f 
+ 2(X-9) t g(X)f 



< oo 
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since E g [\\X - 0|| 4 ] < oo and Eg[\\g{X)\\ 4 } < oo and, 
consequently, since \(X-efg(X)\ < \\X -9\\\\g(X)\\ 
implies that 

Eg[\(X-9) t g{X)\ 2 ] 

<Eg[\\X-9\\ 2 \\g(X)\\ 2 } 

<(E 6 [\\x-e\\ 4 }) 1/2 m\\9(x)\\ A ]) 1/2 

by Schwarz's inequality. 

2. a. Parallel to the case where the variance a 2 is 
known, it should be noticed that the corresponding 
domination condition of 6(X, S) over 5o(X, S), that 
is, for any x € R p and any s € R+, (?i + 2)||g(x,s)|| 2 + 
2div x g(x,s) + 2s^||g(x,,s)|| 2 < 0, entails that the 
two conditions E Sa 2[(Sdivg(X, S)) 2 ] < oo and 
E e,a4(S 2 -^\\g(X,S)\\) 2 ] imply the condition 
Eg a 2[S 2 \\g(X, S)\\ 4 ] < oo. Also the derivation of the 
finiteness of R{9, cr 2 , ip) follows a similar way as in l.a. 
b. We will have TZ(9,a 2 ,^,5 ) = E ej0 a[(6o(X, 

S)-y(X)-6\\ 2 /a 2 )) 2 ]<oo&ssoonasE e ^2[(5o( X i 
S)) 2 < oo and E ecT 2[\\p(X) - #|| 4 ] < 'oo. Now 
E e>a 2[(S (X,S)) 2 = E e ^\p + S{(n + 2)\\g(X,S)\\ 2 + 
2div x g(X,S) + 2S^\\g(X,S)\\ 2 }} < oo since we as- 
sume that Eg a 2[(S divx g{X,S)) 2 ] < oo and 
E e!(T 2[S 2 \\g(X,S)h < oo. Also E 9>ga [\\<p(X,S) - 
9\\*] = E Btga [(\\X -6\\ 2 + S 2 \\g(X,S)\\ 2 + 2S(X - 
9) t g(X,S)) 2 ]) 2 ] < oo since Eg[\\X - 9\\ 4 } < oo and 
E ea 2[S 2 \\g(X,S)\\ 4 } < oo (note that \(X - 9fg{X, 
S)\ < \\X - 9\\ \\g(X, 5) || implies that 

Eg^[\(X-9) t Sg(X,S)\ 2 ] 

<Eg^[\\X-9\\ 2 S 2 \\g(X,S)\\ 2 ] 

<{E e ^[U-et]) l i 2 {E e ^[s 2 \\g{x,s)t]fi 2 

by Schwarz's inequality). □ 

A. 2 Additional Technical Lemmas 

This Appendix gives some technical results used 
in Section 4.2. The first two results deal with expec- 
tations conditioned on the radius of a spherically 
symmetric distribution in W x M fc centered at (6,0) 
where 9 £ W p . These expectations reduce to integrals 
with respect to the uniform distribution Ur^ on the 
sphere 

S Rt e = {y = (x,u)£WxR k \ 

(||x_0|| 2 + ||u|| 2 ) 1/2 = fl}. 

If Ejig[tf}] is the expectation of some function ip with 
respect to Ur^, the expectation with respect to the 
entire distribution is given by Eg[tp] = E[En t g[tp]] 



where E is the expectation with respect to the dis- 
tribution of the radius. 

When the spherical distribution has a density with 
respect to the Lebesgue measure, it is necessarily of 
the form f(\\x — 9\\ 2 + ||u|| 2 ) for some function /. 
Then the radius has density R -> <T p+k f(R 2 )R p+k ~ 1 

where o~ p +k = 27r p+fe /r( 2 ^-). Therefore the expec- 
tation of any function ip can be written as 



E e [ 



ip{y)URfi{dy) 



f(R)dR. 



Note that for a vector y 

|2 



(x,u) € Srj, we have 
x = ir(y) and ||u|| 2 = R 2 — \\^{y) — 9\\ 2 where tx is 
the orthogonal projector from W x R onto M. p . Un- 
der Ur,9i the distribution ir(UR t g) of this projector 
has a density with respect to the Lebesgue measure 
on W given by x -> C P /{R 2 - \\x-9\\ 2 ) k / 2 - 1 l BR 6 (x) 



where Cf = Y(^)R 2 - p ~ k /(r(f)W 2 ) and 1 Bro 
is the indicator function of the ball Brq = {x € 
W\\\x - 9\\ < R} of radius R centered at 9 in W. 

According to the above, as a spherically symmet- 
ric distribution on W around 9, the radius o£ir(UR t g) 
has density 

r -> a p C p /(R 2 - r2^*/2-ii,. .^VP-i 
2R 2 ~ p ~ k 



1 l]o,i?[(rjr'- 



„p-i 



(R 1 



„2\fc/2-l 



1 ]0,-R[( 



r). 



B(p/2,k/2) 

We use repeatedly the fact that any such projection 
onto a space of dimension greater than and less 
than p + k is spherically symmetric with a density. 
Then we also often make use of its radial density. 

Lemma A. 2. For every twice weakly differen- 
tiable function g(W — > M p ) and for every function 



(A.5) 



E Rt g[h(\\U\\ 2 )(X-9) t g(X)} 



Era 



H(\\U\ 



-dwg(X) 



|£/||2)fc/2- 

where H is the indefinite integral, vanishing at 0, of 
the function t — > \h(t)t k / 2 ~ l . 

Proof. We have 

E Rfi [h(\\U\\ 2 ){X-9) t g{X)] 

= C% k f h{R 2 - \\x - 9\\ 2 ){x - 9f 

■ g{x){R 2 - \\x - 9\\ 2 ) k ^ 1 dx 

= C R k I (VH(R 2 - \\x - 9\\ 2 )) t g{x) dx 
JB R , e 
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since 



VH{R 2 -\\x-9\\ 2 ) 
= -2H'(R 2 - ||z 

= h(R 2 — \\x — 9\\~){W - ■ \\x 
Then, by the divergence formula, 

EwMunx-eMx)] 



\\ 2 )(x-9) 

e\\ 2 )(R 2 



p,k 



c 



-c 



Br 

p.k 
R 



Br- 



div(H(R 2 
H(R 2 - 



ftk/2-i( x _ e y 



l )g(x))dx 



)A\vg{x)dx. 



Now, if cr/j.e denotes the area measure on the sphe- 
re SRfii the divergence theorem insures that the first 
integral equals 



C 



p.k 
R 



(H{R 2 



I 2 M*))'t£ 



-a Rfi {dx) 



and is null since, for x G Sj^g, R 2 — \\x — 6\\ 2 = and 
H(0) = 0. Hence, in terms of expectation, we have 

EnAKWUf^X-efgiX)] 



cf 



H{R 2 



l ) 



E 



H.h 



B R ,e ( R2 ~ 
■(R 2 - 

H(\\U\\ 2 ) 



\\ x -e\\ 2 ) k / 2 -i 
\\ x -eff/*-Ux 

T divg(X) 



div g{x) 



|[/||2)fc/2 

which is the desired result. D 

Corollary A.l. For every twice weakly differ- 
entiable function 7(M P — > K + ) and for every inte- 
ger q, 

E^wux-ew 2 ^)] 



p 



k + q 
+ 



ErAW 
i 



\ q+2 i{x)\ 

■E Rfi [\\ur +i ^{x)\. 



{k + q){k + q + 2) 

Proof. Take h(t) = t q l 2 and g{x) = j(x)(x - 9) 
and apply Lemma A. 2 twice. □ 
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